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O Abstract 
(N 

^ We show how the expectation-maximization (EM) algorithm can be applied exactly 

D for the fitting of mixtures of general multivariate skew t (MST) distributions, eliminating 

C/} the need for computationally expensive Monte Carlo estimation. Finite mixtures of MST 

distributions have proven to be useful in modelling heterogeneous data with asymmetric 
and heavy tail behaviour. Recently, they have been exploited as an effective tool for mod- 
i— i elling flow cytometric data. However, without restrictions on the the characterizations 

of the component skew ^distributions, Monte Carlo methods have been used to fit these 
models. In this paper, we show how the EM algorithm can be implemented for the itera- 
tive computation of the maximum likelihood estimates of the model parameters without 
resorting to Monte Carlo methods for mixtures with unrestricted MST components. The 
fast calculation of semi-infinite integrals on the E-step of the EM algorithm is effected by 
noting that they can be put in the form of moments of the truncated multivariate non- 
central i-distribution, which subsequently can be expressed in terms of the non-truncated 
form of the central ^-distribution function for which fast algorithms are available. We 
demonstrate the usefulness of the proposed methodology by some applications to three 
real data sets. 

ON 

1 Introduction 

• • 

Finite mixture distributions have become increasingly popular in the modelling and analysis 
^ of data due to their flexibility. This use of finite mixture distributions to model heterogeneous 
data has undergone intensive development in the past decades, as witnessed by the numer- 
ous applications in various scientific fields such as bioinformatics, cluster analysis, genetics, 
information processing, medicine, and pattern recognition. Comprehensive surveys on mixture 
models and their applications can be found, for example, in the monographs by Everitt and 
Hand (1981), Titterington, Smith, and Markov (1985), Lindsay (1995), McLachlan and Basford 
(1988), and McLachlan and Peel (2000), among others; see also the papers by Banfield and 
Raftery (1993) and Fraley and Raftery (1999). 

Mixtures of multivariate t-distributions, as proposed by McLachlan and Peel (1998, 2000), 
provide extra flexibility over normal mixtures. The thickness of tails can be regulated by an 
additional parameter - the degrees of freedom, thus enabling it to accommodate outliers better 
than normal distributions. However, in many practical problems, the data often involve obser- 
vations whose distributions are highly asymmetric as well as having longer tails than the normal, 
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for example, datasets from flow cytometry (Pyne et al., 2009). Azzalini (1985) introduced the 
so-called skew-normal (SN) distribution for modelling symmetry in data sets. Following the 
development of the SN and skew t- mixture models by Lin, Lee, and Yen (2007), and Lin, Lee, 
and Hsieh (2007), respectively, Basso et al. (2010) studied a class of mixture models where the 
components densities are scale mixtures of skew-normal distributions introduced by Branco and 
Dey (2001), which include the classical skew-normal and skew t-distributions as special cases. 
Recently, Cabral, Lachos, and Prates (2012) have extended the work of Basso et al. (2010) to 
the multivariate case. 

In a study of automated flow cytometry analysis, Pyne et al. (2009) proposed a finite mix- 
ture of multivariate skew t-distributions based on a 'restricted' variant of the skew t-distribution 
introduced by Sahu, Dey, and Branco (2003). Lin (2010) considered a similar approach, but 
working with the original (unrestricted) characterization by Sahu et al. (2003). However, with 
this more general formulation, maximum likelihood (ML) estimation via the EM algorithm 
(Dempster, Laird, and Rubin, 1977) can no longer be implemented in closed form due to the 
intractability of some of the conditional expectations involved on the E-step. To work around 
this, Lin (2010) proposed a Monte Carlo (MC) version of the E-step. One potential draw- 
back of this approach is that the model fitting procedure relies on MC estimates which can be 
computationally expensive. 

In this paper, we show how the EM algorithm can be implemented exactly to calculate the 
ML estimates of the parameters for the (unrestricted) multivariate skew t-mixture model, based 
on analytically reduced expressions for the conditional expectations, suitable for numerical 
evaluation using readily available software. A key factor in being able to compute the integrals 
quickly by numerical means is the recognition that they can be expressed as moments of a 
truncated multivariate non-central t-distribution, which in turn can be expressed in terms of 
the distribution function of a (non-truncated) multivariate central t-random vector, for which 
fast programs already exist. We show that the proposed algorithm is highly efficient compared 
to the version with a MC E-step. It produces highly accurate results for which, if MC were to 
achieve comparable accuracy, a large number of draws would be necessary. 

The remainder of the paper is organized as follows. In Section[2j for the sake of completeness, 
we include a brief description of the multivariate skew t-distribution (MST) used for defining 
the multivariate skew t-mixture model. We also describe the truncated t-distribution in the 
multivariate case, critical for the swift evaluation of the integrals on the E-step occurring in 
the calculation of some of the conditional expectations. Section [3] presents the development 
of an EM algorithm for obtaining ML estimates for the MST distribution. In the following 
section, the finite mixture of MST (FM-MST) distributions is defined. Section [5] presents an 
implementation of the EM algorithm to the fitting of the FM-MST model. An approximation to 
the observed information matrix is discussed in Section [6} Finally, we present some applications 
of the proposed methodology in Section [7) 

2 Preliminaries 

We begin by defining the multivariate skew t-distribution and briefly describing some related 
properties. Some alternative versions of the distribution are also discussed. Next, we introduce 
the truncated multivariate t-distribution and provide some formulas for computing its moments. 
These expressions are crucial for the swift evaluation of the conditional expectations on the E- 
step to be discussed in the next section. 
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2.1 The Multivariate Skew ^-Distribution 



Following Sahu et al. (2003), a random vector Y is said to follow a p-dimensional skew t- 
distribution with p x 1 location vector fi, p x p scale matrix 53, p x 1 skewness vector 8, and 
scalar degrees of freedom u, if its density is given by 

f p (y; fi, 53, <5, u) = 2%,, (y; ^ Q) T p , u+P (y*; 0, A) , (1) 

where 

A = diag(S), 
n = £ + AA T , 



2/ 



v + p 



u + d( y y 

q = A T n- 1 (y-/x), 
d(y) = (y-n) T n- 1 (y- f i), 
A — I p - A T fl- 1 A. 

Here the operator diag(<5) denotes a diagonal matrix with diagonal elements specifed by the 
vector 5. Also, we let t PjU (.; /a, 53) be the p-dimensional t-distribution with location vector fx, 
scale matrix 53, and degrees of freedom u, and T PjV {.; fi, 53) the corresponding (cumulative) 
distribution function. The notation Y ~ ST PjU (/j,, 53, 5) will be used. Note that when 5 = 0, 
([TJ reduces to the symmetric t-density t p>1/ (y\ fi, 53). Also, when v — » oo, we obtain the skew 
normal distribution. 

The MST distribution admits a convenient hierarchical form, 

Y\u,w ~ JV p (/i + A«,iS), 

J7 | w ~ ffJVp ^0, ^I p 

(V V 

W ~ gamma y-, - 

(2) 

where I p is the p x p identity matrix, Nk{fJ>, 53) denotes the multivariate normal distribution 
with mean fj, and covariance matrix 53, HN p (0, 53) represents the p-dimensional half-normal 
distribution with mean and scale matrix 53, and gamma(a, 0) is the Gamma distribution 
with mean a/f3. 

We observe from ^ that the MST distribution (fij) has the following stochastic representa- 
tion. Suppose that conditional on the value w of the gamma random variable W, 

^•)~ M ((S)-(^ v-))- 

where I p denotes the px p identity matrix, denotes the zero vector of appropriate dimension, 
and Uq is a p-dimensional random vector. Then 

Y = A\U\ + U (4) 

has the multivariate skew t-distribution density (IT]). In the above, \U\ denotes the vector 
whose ith element is the magnitude of the ith element of the vector U. It is worth noting 
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that, although also known as the multivariate skew t-distribution, the versions considered by 
Azzalini and Dalla Valle (1996), Gupta (2003), and Lachos, Ghosh, and Arellano- Valle (2010), 
among others, are different from ([I]). These versions are simpler in that the skew t-density is 
defined in terms involving only the univariate t-distribution function instead of the multivariate 
form of the latter as used in (JT1). Recently, Pyne et al. (2009) proposed a simplified version of 
the skew t-density given by ([!]) by replacing the term A \ U\ in Q by the term S \ U\, where U 
is a univariate standard normal random variable, leading to the reduced skew t-density: 

2%, v (y; f j,,V)T ljU+p (y* 1 ;0,l) , (5) 

where y\ = [(u + p) / (v + d (y))} 5 <5 T f2 _1 <5. One immediate consequence of this type of 'sim- 
plification' is that the correlation structure of the original symmetric model is affected by the 
introduction of skewness, whereas for ([!]) the correlation structure remains the same, as noted 
in Arellano- Valle, Bolfarine, and Lachos (2007). Nevertheless, one major advantage of having 
simplified forms like ^ is that calculations on the E-step can be expressed in closed form. 
However, the form of skewness is limited in these characterizations. Here, we extend their 
approach to the more general form of the skew t-density as proposed by Sahu et al. (2003). 



2.2 The truncated multivariate ^-distribution 

Let X be a p-dimensional random variable having a multivariate t-distribution with location 
vector ^t, scale matrix S, and v degrees of freedom. Truncating x to the hyperplane region 
A = {x < a, a G W J }, where x < a means each element Xj = (x)i is less than or equal to 
cii = (a)i for i = 1, . . . ,p, results in a right-truncated t-distribution whose density is given by 

f A (x; fx, £, p) = T~l (a; /a, S) V i x 'i A*, s , f) , x e A. (6) 

For a random vector X with density (JgJ) , we write X ~ tt PjU (n, S; A). For our purposes, 
we will be concerned with the first two moments of X, specifically E(X) and E(XX T ). 
Explicit formulas for the truncated central t-distribution in the univariate case tt\^ v (0,cr 2 ;A) 
were provided by O'Hagan (1973), who expressed the moments in terms of the non-truncated 
t-distribution. The multivariate case was studied in O'Hagan (1976), but still considering the 
central case only. We will generalize these results to the multivariate non-central case here. 

Before presenting the expressions, it will be convenient to introduce some notation. Let x 
be a vector, where X; denotes the ith element and Xij is a two-dimensional vector with elements 
Xi and Xj. Also, x^i and a?_jj represents the (p— 1) and (p — 2) -dimensional vector, respectively, 
with the corresponding elements removed. For a matrix X , x^ denotes the r/'th element, and 
X^ defines the 2x2 matrix consisting of the elements Xa, x^, Xji and Xjj. X_i is created 
by removing the ith row and column from X . Similarly, X_^ is the (p — 2) square matrix 
resulting from the removal of the ith and jth row and column from X. Lastly, Xnj\ is the ith 
and jth column of X with the elements of X^ removed, yielding a (p — 2) x 2 matrix. We now 
proceed to the expressions for the first two moments of X . 

With some effort, one can show that the first moment of ^ is 

E(X) = /^-cT 1 !^ 

= AA (7) 
where c = T v v (a — n; 0, S), and £ is a p x 1 vector with elements 

\u + a u {ai-fii) 2 ; r (|) v 2 
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for i = 1, . . . , p, and where 

a* = (a-i - - (a_i - tr^E^ 

and 



The second moment is given by 

E(XX T ) = fjifj? + w* T + n*n T - c^EJfS 

+^ fe) ^ (a - M ; 0, fe) S) £, (8) 

where H is & p x p matrix with off-diagonal elements 



and diagonal elements, 



and 



v* = v + (o« - in?? £»/ (a„ - , 



It is worth noting that evaluation of the expressions ^ and ^ rely on algorithms for 
computing the multivariate central t-distribution function for which highly efficient procedures 
are readily available in many statistical packages. For example, an implementation of Genz's 
algorithm (Genz and Bretz, 2002, Kotz and Nadarajah, 2004) is provided by the mvtnorm 
package available from the R website. 

3 ML Estimation for the MST Distribution 

In this section, we describe an EM algorithm for the ML estimation of the MST distribution 
specified by (1). To apply the EM algorithm, the observed data vector y = [yj, . . . ,y^) is 
regarded as incomplete, and we introduce two latent variables denoted by u and w, as defined 
by ([2]). We let be the parameter containing the elements of the location parameter fj,, the 
distinct elements of the scale matrix S, the elements of the skew parameter 8, and the degrees 
of freedom v. It follows that the complete-data log-likelihood function for 6 is given by 

L c (6;y,u,w) = K - fnlog |S| - nlogT (\v) + |wlog (\v) 

-\w {d(y) + (u-q) T \- 1 (u-q)) 

+ Qz/ + p-l)logH, (9) 
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where K does not depend on 0. 

The implementation of the EM algorithm requires alternating repeatedly the E- and M- 
steps until convergence in the case where the sequence of the log likelihood values L{0^) is 
bounded above. Here 0^ denotes the value of after the kth iteration. 

On the (k + l)th iteration, the E-step requires the calculation of the conditional expectation 
of the complete-data log likelihood given the observed data y, using the current estimate 0^ 
for 0. That is, we have to calculate the so-called Q- function defined by 

Q(0; 0^) = E 0ik) {L c (0; y, u, w) | y} , (10) 

where E (k) denotes the expectation operator, using 0^ for 0. This, in effect, requires the 
calculation of the conditional expectations 

eg = E eW {\og{Wj) \yj}, 
eg = E 9(k) {Wj | Vj ) , 
45 = E e^ {WjUj | Vj ) , 
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( 3 = E e^ (WjUjUJ | Vj ) . 



Note that the Q-function does not admit a closed form expression for this problem, due to 
the conditional expectations eg, eg, and eg not being able to be evaluated in closed form. 

(k) 

Concerning the calculation of the expectation e\ j, the conditional density of Wj given y^ 
is given by 

f ^ i *>) = — 7 L L\ - ( n ) 

where 



*<k) 




and is the zero vector of appropriate dimension. 

The conditional expectation E e ( k ) \\og(Wj) \ y^] can be reduced to 



(k) ( v [k) + V 



T , t , f« W / "<*>+P+2 ■ Q A(fc)\ 

Jp,„(fc)+p+2 [Vj y z,W+dW( y .)' U ' 7V / 



6 



where the last term S is given by 



s 



v 



if, 



j/OO _|_ p 



7/ 



(fc) 



P 



T , ( n W I ^ (fc) +P+2 ~ . Q A( fc ) 



u( k ) + d( k )(y j ) 
[tt (i/W+p)] 



(*) 



0,A (fc) 



|AH 



,(*0 



,0)- 



and S'j*- is an integral given by 
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(k) 

1,3 



0,A« 



(fc) 



(13) 



% 1 + 



1 + 



^ A (*)- 



,0) 



ds\ds2 • • ■ dsp } 



(14) 



and ijj(-) denotes the Digamma function. 

Combining (12) and (13), efi- can be reduced to 



,(*) 



,(*) 



+ P - log 



!/(*) +d(*)(y ,) 



-;0,aW)5{5. 



(15) 



We note that the term 5 1 will be very small in practice since it would be zero if we adopted 
a one-step late (OSL) EM algorithm (Green, 1990).. In which case, there would be no need to 
calculate the multiple integral Sf • in (13). Hence then, e[ fc - can be approximated as 



,(*) 



_|_ p 



(k) +p+2 



(fc) 



0,A (fc) 



!/(*) + d(*)(y i ) 



log 



T p ,, W+p (?/; (fc) ;0,A( fc ) 



j/Cfe) _|_ p 



_|_ p 



(16) 



However, in computing e\j in the examples to follow, we did not take S{j to be zero but 
calculated it by numerical integration. 

It can be easily shown that e^l can be written in closed form (see, for example, Lin (2010)), 
given by 



_|_ p 



v( k ) + d( k )( yj 



v 



0,A (fc) 



(17) 
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To obtain e^j and ejfj, first note that the joint density of y j; Uj, and Wj is given by 



,(*) 



(*) 



2 



,(*0 



Using Bayes' rule, the conditional density of it,- and lu^ given y ■ can be written as 



(18) 



wfT ( w 



f(uj,Wj | y J } 



J 31 2 ' 2 



(2tt 



A (fc) 



,(*) 



0,A (fc) 



From (19), standard conditional expectation calculations yield 



,(*) 

'3,j 



z/( fc )+p+2 



(fc). n f ^+d( k Hy,) \ A ( fc ) 



i/W+p+2 



(19) 



where S^J represents the expected value of a truncated p-dimensional t-variate X, which is 
distributed as, 



L< fe > / ^ (fc) +P + 2 a (fc). hj+ 



j y i/(*) + d(*)( yj .)'" 

That is, the random vector X is truncated to lie in the positive hyperplane M + . 
Analogously, e^j can be reduced to 



(20) 



,(*) 

-4,j 



(*). n A ( fc) 



a(k) 



(21) 



where S^j represents the second moment of X. The truncated moments and can be 
swiftly evaluated using the expressions (m) and dsl) in Section 2.2 



3.1 M-step 



On the (k + l)th iteration, the M-step consists of the maximization of the Q-function (10) with 
respect to 0. For easier computation, we employ the ECM extension of the EM algorithm, 
where the M-step is replaced by four conditional-maximization (CM)-steps, corresponding to 
the decomposition of 6 into four subvectors, 6 = (Oj, 62, O3, #4) T , where Q\ =11,62 = S, 9 3 is 
the vector containing the distinct elements of S, and #4 = v. To compute ^ k+l \ we maximize 
Q(H,e { 2 k \e { 3 k \ei k) ;0 {k) ) with respect to fi, and to compute 6 {k+1 \ we first update n to /^ fe+1 ) 
and then maximize g(^ fc+1 ),(5,^ fc) ,0f ) ;6l (fc) ) with respect to <5, and so on. 
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For clarity and distinction from diag (a) introduced earlier, we let DIAG(A) denote the 
operator that produces a vector by extracting the diagonal elements of A. Simple algebraic 
manipulations lead to the following closed form expressions for ^ k+l \ ~E^ k+1 \ and A^ fc+1 - ) , 



and 



(fc+i) 



A* 



S (k+i) = DIAG 



V n \e {k) v -A {k) e (k) 

1^3=1 e 2,jVj e 3,j 



y^n (fc) 
2^=1 e 2,j 



n 




n 




E - 4f 






1 


.i=i 




.5=1 





1 n 

S (fc+i) = iy^ r A (fc+i) e4;J (fc) T A (fc+i) T 
i=i 



+ fc-/^ (fe+1) ) fc"^ +1) ) T e^ 
-A^eg (y, - 



(22) 
(23) 



(24) 



where A (/c+1) = diag (<5 (fc+1) ' 

An updated estimate of the degrees of freedom z/ fc+1 ) is obtained by solving the equation 



log 



,(fc+i) 



j=i 



(25) 



In summary, the ECM algorithm proceeds as follows on the (k + l)th iteration: 

E-step: Given 6 = 6^ k \ compute the four conditional expectations e^j, e^j, e^j and 
by using (15), (17), (19), and (21), respectively, for j = 1, . . . , n. 



M-step: Update ^ (fc+1) , S (k+1) , £ (fc+1) and by using (22), (23), and (24). Calculate ^ fe+1 ) 



by solving (25). 



4 The Multivariate Skew t-Mixture Model 

The probability density function (pdf) of a finite mixture of g multivariate skew t-components, 
using the notation above, is given by 



/ (y; *) = X! * h fp Vh, s ?m s h, Vh) 



(26) 



where f p (y; fi h , S^, Sh, Vh) denotes the ith MST component of the mixture model as defined by 
(fl|, with location parameter fjt h , scale matrix E^, skew parameter <5/i, and degrees of freedom 
Vfr. The mixing proportions n h satisfy n h > (h = 1, . . . , g) and Ylh=i n h = 1- We shall denote 
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the model defined by (26) by FM-MST (finite mixture of MST) distributions. Let ^ contain 
all the unknown parameters of the FM-MST model; that is, = (jri, . . . , 7T g -i, 0\ , . . . , 
where now 8h consists of the unknown parameters of the ith component density function. 

To formulate the estimation of the unknown parameters in the FM-MST model as an 
incomplete-data problem in the EM framework, a set of latent component labels Zj = (z\j, . . . ,z g j) 
1, . . . ,n) is introduced, where each element Zhj is a binary variable defined as 

J 1, if yj, belongs to component i, , . 

^ \ 0, otherwise, 

and Y^h=i z hj = 1 (j = l,---,n). Hence, the random vector Zj corresponding to Zj fol- 
lows a multinomial distribution with one trial and cell probabilities 7Ti, . . . ,7r 3 ; that is, Zj ~ 
Mult g (l; iii, ... , Tig). It follows that the FM-MST model can be represented in the hierarchical 
form given by 

Yj | Uj, Wj, z hj = 1 ~ N p (^ h + A h Uj, —T, h 

V w j 

Uj I Wj, z hj = 1 ~ HN p ( 0, — I p J , 



I = 1 ~ gamma 



2 ' 2 

~ Mult g (l,7r), (28) 



where A h = diag (5h) and 7r = (7r 1; . . . , vr^) 7 ". 

5 ML Estimation for FM-MST Distributions 



From the hierarchical characterization (28) of the FM-MST distributions, the complete-data 
log-likelihood function is given by 



log L c (*) = log L lc (*) + log L 2c (*) + log L 3c (*) , (29) 



where 



g n 



h=l j=l 

£-(•) = E E «* [(y) '°g (y) + (y + f - "* W 



ft=l i=l 



^ ( w 3 



f 1 

L 3c (*) = ^ X] -P lo % ( 27r ) - 2 log |S/l1 

- y K W + ^ " qh ^ Ah 1 ^ " } ' (30) 
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and where 



A h = I p - AjO^Afc, 



It is clear from (29) that maximization of the Q-function of the complete-data log likelihood 
(McLachlan and Krishnan, 2008), 

Q(tf ;*(*)) = E^w {logL c (*) I y}, 

only requires maximization of the components functions Lh c (^f) separately (h = 1,2,3). The 
necessary conditional expectations involved in computing the Q-function with respect to (30) 
are, namely, 

T hf = E^ k) {Z hj | Vj }, 
e i% = £*c*){log(W,-) I Vj,Zhj = 1}, 
45, = E 9W {WjUj | yj,z hj = 1}, 
4% = E^wiWjUj | Vj ,z hj = 1}, 



e 



( fc ) _ Z? ... flX/.TT TT T 



E^WjUjUj \ Vj ,z hj = l}. (31) 



The posterior probability of membership of the hth component by using the current 
estimate for is given using Bayes' Theorem by 



r ( fc ) = : " v - : : : l 

h ? spg (fc) , ( . (fc) y (fc) e(fc) (fc)V v ; 



(fc) A (fc) ..(fc) 



Let = Qhj y w+ rf w| y The other four expectations have analogous expressions to 

their one-component counterpart given in Section [3} 
They are given by 

= ^i-p) -**[ "" '? ""' ) (33) 




Jfc) . ( "h ^ v I •' V "V " h yyj> j , 

e 2,hj — \ Ik) Jk) i \ I / *.ru\ ' \ 0q: ) 



= ( -i^d—, I ^> , _ ( yS" ; ; 0' K> ) sffi, (35) 

(fc) _ f "h \ T -l (*(*). n A (k)\ s (k) ( , } 
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where S[ k lj is a scalar defined by 



(k) 



'hi 



s T A 1 



h s 



logll 



-+p 



du, 



h 8 



(37) 



?( fe ) 5 



is a p x 1 vector whose rth element is 



o Jo 



i/£ fc) +p + 2 



(k) 



du, 



(38) 



and S% ^ ■ is a p x p matrix whose (r, s)th element is 



(fc) | ^ + «h 12/ j 



ii -/(I ./n " h \ \ v\ ' + p + 2 



(39) 



where, for convenience of notation, du is used to denote dui, du2, ■ ■ ■ , g?m p . 

It is important to note that and S^lj are related to the first two moments of a truncated 
p-dimensional t-variate X. More specifically, let 



X ~ ttp >Uh +p+2 ( 



the truncated t-distribution as defined by (|6]). Then 



z/[ fc) + p + 2 



and 



5. 



(*) 

3,ftj 



«S } ;o, 



which can be implicitly expressed in terms of the parameters , d h (yj), , i/£ using 

results (7) and (8). It is worth emphasizing that computation of e^j and depends on 
algorithms for evaluating the multivariate t-distribution function, for which fast procedures are 
available. 



In summary, the ECM algorithm is implemented as follows on the (k + l)th iteration: 



E-step: Given \I> = vj/^, compute rj^' using (32), and e^L> e^lj, e fhj-> anc ^ e fhj as de- 



scribed by (33), (34), (35), and (36) respectively, forH 



1, • • • ,9 and j = 1, . . . ,n. 



M-step: Update the estimate of by calculating for h = 1, . . . , g, the following estimates 
of the parameters in *S?, 
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and 



d[ k+1) = DIAG 



E 

Li=i 



T hj \ y , - m 



(fc+i)\ (k) T 

h I c 3,hj 





n 






T hj e 4,hj 






.3=1 





l(fc+l) 



E 



.(*) 

_(fc) ' hj 



En 
j=l T hj j=l 

(k+l)\ (fe) T A (fc+l) 



A (fc+1) ^W T A (fc+l) T 



, / / (*+i)V ( fc ) 



hj 



An update v^ 1 ^ of the degrees of freedom is obtained by solving iteratively the equation 



log 



v, 



(fc+i)' 



ffc+i) 



^pn [ (fe) / (fc) _ (fc) _ -I 

^i=i r fej l e 2,/ij e l,hj 1 



s-^n (fc) 

Z^i=i 'hj 



2 1 \ 2 

A program for implementing this EM algorithm has been written in R. 



6 The Empirical Information Matrix 

We consider an approximation to the asymptotic covariance matrix of the ML estimates using 
the inverse of the empirical information matrix (Basford et al., 1997). The empirical information 
matrix is given by 



(40) 



where s yy^ &j = E^, {d log L C j (\l>) /d^f \ y^] (j = 1, . . . , n) are the individual scores, con- 



sisting of 



l s J>i! • • • > Sj )7rg _ 1 , Sj t/il , . . . , -Sj !Mg , 



s j,A s ) • • • ) S j,Sg1 s j. 



mi " - 1 ^3^9 



S j,Va)- 



We let L C j denote the complete-data log likelihood formed from the single observation y . 



An estimate of the covariance matrix of is given by taking the inverse of (40). After some 
algebraic manipulations, one can show that the elements of s yyy, 4^ are given by the following 
explicit expressions: 



T hj _ 1 jjj_ 

3'^h „ „ ' 
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T hj t h e 2 ,ij (Vj -jj, h )-A 



h^3,ij > 



\r h j [(Vj ~ Aft) {Vj - Aft) T - (Vj - Aft) e lhj^h 
&he 3 ,hj (Vj - Aft) + A ft e3,ij (j/j- - Aft) + S h e% h j^h 

TTyE/diag {yj-fi h )el hj -k h e^ hj , 
\r hj [log + 1 + ei,/y - V (§J>a) - e 2 ,ftj] • 



7 Examples 

In this section, we fit the FM-MST model to three real data sets to demonstrate its usefulness 
in analyzing and clustering multivariate skewed data. In the first example, we focus on the 
flexibility of the FM-MST model in capturing the asymmetric shape of flow cytometric data. 
The next example illustrates the clustering capability of the model. In the final example, we 
demonstrate the computational efficiency of the proposed algorithm. 

7.1 Lymphoma Data 

We consider a subset of the T-cell phosphorylation data collected by Maier et al. (2007). In 
the original data, blood samples from 30 subjects were stained with four fluorophore-labeled 
antibodies against CD4, CD45RA, SLP76(pY128), and ZAP70(pY292) before and after an 
anti-CD3 stimulation. In this example, we focus on a reduced subset of the data in two 
variables - CD4 and ZAP70. This bivariate sample (Figure [TJ is apparently bimodal and 
exhibits asymmetric pattern. Hence we fit a two-component FM-MST model to the data. 
More specifically, the fitted model can be written as 



For comparison, we include the fitting of a two-component mixture of skew t-distributions 
from the Skew-normal Independent (SNI) family (Lachos, Ghosh, and Arellano- Valle, (2010)), 
hereafter named the FM-SNI-ST model. The estimated FM-SNI-ST density can be computed 
from the R package mixsmsn (Cabral, Lachos, and Prates, (2012)). Note that the MST distri- 
bution is different to the SNI-ST distribution since the skewing function is not of dimension 
one. Moreover, under the FM-SNI-ST settings, the correlation structure of Y will also be 
dependent on the skewness parameter, whereas for the FM-MST distributions the covariance 
structure is not affected by 5. The contours of the fitted SNI-ST and MST component densities 
are depicted in Fig[T](b) and Fig[]Jc), respectively. To better visualize the shape of the fitted 
models, we display the estimated densities of each component instead of the mixture contours. 
It can be seen that the FM-MST model provides a noticeably better fit. From a clustering 
point of view, the FM-MST model also shows better performance as it is able to separate the 
two clusters correctly. Moreover, it adapts to the asymmetric shape of each cluster more ad- 
equately. Thus the superiority of FM-MST model is evident in dealing with asymmetric and 
heavily tailed data in this data set. 



h (Vj) *) = ni/a (Vj\ Mi, s i, <5i, i/i) + (1 - tti) f 2 (yp M 2 > s 2, S 2 , v 2 ) , 



where 
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Figure 1: Mixture modelling of a reduced subset of prephosphorylation T cell population. Bi- 
variate skew t-mixtures were fitted to the data restricted in two dimensions CD^5 and ZAP70. 
(a) Hue intensity plot of the Lymphoma data set; (b) the contours of the component densities 
in the fitted two-component skew t-mixture model FM-SNI-ST using the R package mixsrnsn; 
(c) the fitted component contours of the two-component FM-MST model. 



7.2 GvHD Data 

Our second example concerns a data set collected by Brinkman et al. (2007), where peripheral 
blood samples were collected weekly from patients following blood and bone marrow transplant. 
The original goal was to identify cellular signatures that can predict or assist in early detection 
of Graft versus Host Disease (GvHD), a common post-transplantation complication in which 
the recipient's bone marrow was attacked by the new donor material. Samples were stained 
with four fluorescence reagents: CD4 FITC, CD8/3 PE, CD3 PerCP, and CD8 APC. Hence we 
fit a 4-variate FM-MST model to a case sample with a population of 13773 cells. The data set 
is shown in Figure [2j where cells are displayed in five different colours according to a manual 
expert clustering into five clusters. In addition, we include the results for the FM-SNI-ST 
model and a restricted version of the FM-MST model introduced in Section 2.1 (equation [5]), 
hereafter denoted by FM-RMST. 

We compare the performance of the three models FM-MST, FM-SNI-ST, and FM-RMST in 
assigning cells to the expert clusters. Manual gating suggests there are five clusters in this case 
sample. Hence we applied the algorithm for the fitting of each model with g predefined as 5. 
For a fair comparison, we started the three algorithms using the same initial values. The initial 
clustering is based on /c-means.The degrees of freedom are set to be identical for all components 
for the first iteration and assigned a relatively large value. A similar strategy was described in 
Lin (2010). 

To assess the performance of these three algorithms, we take the manual expert clustering 
as being the 'true' class membership and we calculated the error rate of classification against 
this benchmark result with dead cells removed, measured by choosing among the possible 
permutations of class labels the one that gives the highest value. 

As anticipated, the optimal clustering result was given by the FM-MST model. It achieved 
the lowest misclassification rate. The FM-SNI-ST model has a higher number of misallocations. 
The FM-RMST model has a disappointing performance in terms of clustering. Its error rate 
is almost double that of its competitors. It is worth pointing out that both the FM-MST and 
FM-RMST models have 99 free parameters, while the FM-SNI-ST model has 95 parameters. 
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Figure 2: GvHD data set: Expert manual clustering of a population of 13773 cells stained with 
four fluorescence reagents - CD4 FITC (FL1-H), CD8/3 PE (FL2.H), CD3 PerCP (FL3.H) 
and CD8 APC (FL4.H). 

Table 1: Clustering error rates for various multivariate skew t mixture models on the GvHD 
data set. 



Model 


Error rate 


Number of free parameters 


FM-MST 


0.10807 


99 


FM-SNI-ST 


0.13078 


95 


FM-RMST 


0.20700 


99 



It is evident that these two restricted models have inferior performance. This reveals some 
evidence of the extra flexibility offered by the more general FM-MST model. 

7.3 AIS Data 

We now illustrate the computational efficiency of our exact implementation of the E-step of the 
EM algorithm as in Section [5j where functions not in closed form are evaluated numerically. 
We denote this version of the EM algorithm with the exact E-step as EM-exact. In addition, 
we consider the EM alternative with a Monte Carlo (MC) E-step as given by Lin (2010), 
which is denoted by EM-MC. Since both models are based on the same characterization of the 
multivariate skew t-distribution defined by Sahu et al. (2003), it is appropriate to compare their 
computation time. We assess their time performance on the well-analyzed Australian Institute 
of Sport (AIS) data, which consists of p = 13 measurements made on n = 202 athletes. As in 
Lin (2010), we limit this illustration to a bivariate subset of two variables - body mass index 
(BMI) and the percentage of body fat (Bfat). As noted by Lin (2010), these data are apparently 
bimodal. Hence a two-component mixture model is fitted to the data set. 

A summary of the results are listed in Table [2} Also reported there are the values of the log- 
likelihood, the Akaike information criterion (AIC) (Akaike, 1974) and the Bayesian information 
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Table 2: Computation time and clustering error rates for two different implementations of 
the EM algorithm for the multivariate skew t mixture models on the AIS data set. For EM- 
exact, the E-step is implemented exactly as described in Section [5} As an alternative, the EM 
algorithm was implemented with a Monte Carlo E-step, EM-MC, as in Lin (2010). Time is 
measured in seconds. 



Model 


EM-exact 


EM-MC 


Component 


1 


2 


1 


2 


7T 


0.58 


0.42 


0.59 


0.41 




19.95 


22.55 


19.89 


22.47 




15.82 


7.25 


15.50 


7.30 




3.83 


3.30 


2.96 


3.23 


^i,12 


6.19 


1.35 


6.17 


1.34 




25.60 


2.15 


25.80 


2.14 


Sn 


3.55 


0.64 


2.72 


0.71 




1.82 


1.23 


2.22 


1.13 


V 


17.65 


25.92 


23.98 


25.93 


L{9) 


-1075 


1.937 


-108< 1 


3.066 


AIC 


2191.837 


2207.956 


BIC 


2248.114 


2264.197 


error rate 


0.08910891 


0.08910891 


time 


45.49 


349.9 



criterion (BIC) (Schwarz, 1978) defined by 

AIC = 2m - 2L (*) and BIC = mlogn - 2L (*) , (41) 

respectively, where L(^Sf) is the value of the log likelihood at m is the number of free 
parameters, and n is the sample size. Models with smaller AIC and BIC values are preferred 
when comparing different fitted results. The best value from each criterion are highlighted in 
bold font in Table [2j For this illustration, the EM-MC E-step is undertaken with 50 random 
draws as recommended by Lin (2010). The gender of each individual in this data set is also 
recorded, thus enabling us to evaluate the error rate of binary classification for the two methods. 
Both algorithms were terminated after the 23r<i iteration, yielding the same misclassification 
rate of 0.0891. 

Not surprisingly, the model selection criteria favour the EM-exact algorithm. Not only 
did it achieve lower AIC and BIC values, the computation time is remarkably lower than its 
competitor. It is more than seven times faster than the EM-MC alternative. 

8 Computation Time and Accuracy for E-step 

We now proceed to two interesting experiments for evaluating the computational cost and ac- 
curacy of using the EM-exact and EM-MC algorithms on high-dimensional data. As pointed 
out previously, the main computational cost for EM-exact is evaluating the multivariate t- 
distribution function. Calculation of the first two moments of a p-variate truncated t-distribution 
requires the evaluation of two T p (-) functions, p evaluations of T p _ 1 (-), and \p{p — 1) evalua- 
tions of T p _ 2 (-). Hence, the computation time will increase substantially with the number of 
dimensions. However, with the EM-exact algorithm, accuracy can be compromised for time. 
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Figure 3: Comparison of performance of the EM-MC and EM-exact methods on a subset of 
100 samples from the Brain Tumor data. Green line: EM-MC with 500 draws, red line: EM- 
MC with 100 draws, blue line: EM-MC with 50 draws, black line: EM-exact. (a) Typical 
computation time for E-step on a sample of 100 data in various dimensions, (b) Total absolute 
error of E-step for one data point. 

We sampled 100 data from a Brain Tumor dataset supplied by Geoff Osborne from the 
Queensland Brain Institute at the University of Queensland. In both experiments we varied 
the dimensionality p of the sample. The graph in Figure [3|a) shows the typical CPU time per 
each E-step iteration for various dimensions p of the data; EM-MC (m) represents the EM-MC 
algorithm with m random draws using the Gibbs sampling approach described in Lin (2010). It 
is worth noting that in both experiments EM-exact is evaluated with a default tolerance of at 
least 10 -6 . As seen in Figure[3j EM-exact is the fastest among the four versions of the E-step for 
low dimensions. For example, at p = 2, EM-exact at least eight times faster than EM-MC(50). 
It is important to note that although EM-MC(50) is slightly faster than EM-exact at higher 
dimensions, EM-exact produces results to a significantly higher accuracy, while EM-MC requires 
a large number of draws to achieve comparable results. We note that in our simulations, for 
example, at p = 7, 50 draws is insufficient to achieve acceptable estimates. Preliminary results 
suggests that at least 500 draws is required to generate reasonable approximations when p is 
greater than 6. In this case, EM-exact is at least 18 times quicker. Furthermore, EM-exact also 
has an additional advantage over the EM-MC alternative in that its results are reproducible. 

To compare the accuracy of the EM-exact and EM-MC algorithms, we compute the total 
absolute error against the baseline EM-exact with a maximum tolerance of 10 -18 . For each of 
the EM-MC (m) algorithms, the average total absolute error of 100 trials is used. For EM-exact, 
the default tolerance is set to 10 -6 . The results are shown in Figure [3|b). Not surprisingly, 
the absolute error of the EM-MC algorithm is significantly higher than that of the EM-exact 
algorithm. It can be observed that the absolute error is very high even for EM-MC(500). At 
p = 10, for example, EM-exact is at least 30000 times more accurate and takes less than half 
the time required for EM-MC(500). 

It is important to emphasize that as the dimension p of the data increases, EM-MC re- 
quires considerably more draws to provide a comparable (and acceptable) level of accuracy as 
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EM-exact, which can be computationally intensive. Hence we advocate the use of EM-exact, 
especially for applications involving high dimensional data. 

9 Concluding Remarks 

We have described an exact EM algorithm for evaluating the parameters of a general multi- 
variate skew t-mixture model. This model has a more general characterization than various 
alternative versions of the skew t-distribution available in the literature and hence offers greater 
flexibility in capturing the asymmetric shape of skewed data. 

Our proposed method is based on reduced analytical expressions for the E-step conditional 
expectations, which can be formulated in terms of the first and second moments of a multivariate 
truncated t-distribution. The latter can then be expressed further in terms of the distribution 
function of the multivariate central t-distribution for which fast algorithms capable of producing 
highly accurate results already exist. It is demonstrated to have a marked advantage over the 
EM algorithm with a Monte Carlo E-step. To achieve comparable accuracy to that of the EM 
algorithm with the E-step implemented using the above numerical approach, the version of the 
algorithm with a Monte Carlo E-step would require a large number of draws, which would be 
computationally expensive. 
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